Skip to content

feat: add Prometheus metrics, Kubernetes events, and monitoring config for SeiNode/SeiNodeGroup#22

Merged
bdchatham merged 1 commit intomainfrom
feat/controller-observability-metrics
Mar 24, 2026
Merged

feat: add Prometheus metrics, Kubernetes events, and monitoring config for SeiNode/SeiNodeGroup#22
bdchatham merged 1 commit intomainfrom
feat/controller-observability-metrics

Conversation

@bdchatham
Copy link
Collaborator

@bdchatham bdchatham commented Mar 24, 2026

Summary

Adds comprehensive observability for the SeiNodeGroup and SeiNode controllers — Prometheus metrics, Kubernetes events, and Prometheus Operator monitoring resources (ServiceMonitor + PrometheusRule).

Metrics added

Metric Type Controller
sei_controller_seinodegroup_phase Gauge (0/1 per phase) SeiNodeGroup
sei_controller_seinodegroup_replicas Gauge (desired/ready) SeiNodeGroup
sei_controller_seinodegroup_condition Gauge (True/False/Unknown) SeiNodeGroup
sei_controller_seinodegroup_reconcile_substep_duration_seconds Histogram SeiNodeGroup
sei_controller_seinode_phase Gauge (0/1 per phase) SeiNode
sei_controller_seinode_phase_transitions_total Counter SeiNode
sei_controller_seinode_init_duration_seconds Histogram SeiNode
sei_controller_seinode_last_init_duration_seconds Gauge SeiNode
sei_controller_sidecar_request_duration_seconds Histogram SeiNode
sei_controller_sidecar_unreachable_total Counter SeiNode
sei_controller_reconcile_errors_total Counter (shared) Both

Events

  • PhaseTransition events emitted on every SeiNode phase change via EventRecorder

Infrastructure

  • Metrics endpoint switched from HTTPS :8443 → HTTP :8080
  • Container port declaration, Service, and NetworkPolicy updated accordingly
  • ServiceMonitor for Prometheus Operator scraping (30s interval)
  • PrometheusRule with 7 alerts: SeiNodeGroupDegraded, SeiNodeGroupFailed, SeiNodeStuckInitializing, SeiNodeStuckPending, SidecarUnreachableHigh, ControllerReconcileErrors, ControllerHighReconcileLatency

Design decisions

  • Phase gauges use the kube-state-metrics 0/1 pattern for PromQL compatibility
  • name label omitted from transition counter and sidecar histogram to control cardinality
  • InitBuckets (10s–1h) used for node init durations vs ReconcileBuckets for substeps
  • HTTP status codes normalized to class buckets (2xx, 4xx, etc.) to bound cardinality
  • Shared observability package centralises helpers and cross-controller metrics
  • SeiNodeStuckPending alert uses max by() aggregation to prevent for timer resets across Pending/PreInitializing transitions

Test plan

  • make test passes (FakeRecorder injected in all test reconcilers)
  • Deploy to dev cluster and verify metrics appear at /metrics
  • Confirm ServiceMonitor is picked up by Prometheus (kubectl get servicemonitor -A)
  • Confirm PrometheusRule alerts appear in Prometheus UI
  • Trigger a phase transition and verify event in kubectl describe seinode
  • Verify Grafana can query all new metrics

Introduce comprehensive observability for SeiNodeGroup and SeiNode
controllers:

Metrics:
- SeiNodeGroup: phase gauge, replicas gauges, condition gauges,
  reconcile substep duration histogram, reconcile error counter
- SeiNode: phase gauge, phase transition counter, init duration
  histogram + last-init gauge, sidecar request duration histogram,
  sidecar unreachable counter, reconcile error counter
- Shared observability package with InitBuckets, NormalizeStatusCode,
  EmitPhaseGauge/DeletePhaseGauge helpers, and centralised
  ReconcileErrorsTotal counter

Events:
- Kubernetes events on SeiNode phase transitions via EventRecorder
- RBAC and FakeRecorder wiring in tests

Infrastructure:
- Switch metrics endpoint from HTTPS :8443 to HTTP :8080
- Add container port declaration and update Service + NetworkPolicy
- ServiceMonitor for Prometheus Operator scraping
- PrometheusRule with 7 alerts (degraded/failed groups, stuck nodes,
  sidecar unreachable, reconcile errors, high latency)
- Wire monitoring resources into config/default kustomization

Made-with: Cursor
@bdchatham bdchatham merged commit a5f797f into main Mar 24, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant